In this project I am going to forecast the number of vehicles sold in the US for the next 12 months. This forecast is based on the USVSales dataset, which is availaible by default in the R enviroment. This dataset is a monthly time series object starting in January 1976 and ending in september 2019. In order to achieve the desired forecast I am going to compare various models (mainly Holt-Winters Model and SARIMA).
library(forecast)
library(TSstudio)
library(plotly)
library(tidyverse)
library(TSstudio)
library(plotly)
library(stats)
library(forecast)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(dygraphs)
library(lubridate)
library(datasets)
library(base)
library(h2o)
library(ggfortify)
library(knitr)
library(gridExtra)
library(formattable)
As previously mentioned, this dataset is included in the R enviroment, so we simply have to use the data() command. After that we use the formattable function, to plot the time series in a table (here I am plotting the last ten observations of the dataset).
data("USVSales")
formattable(ts_to_prophet(tail(USVSales, 10)), align = c("c", "c"), list("ds" = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold")), "y" = formatter("span", style = ~ style(color = "grey", font.weight = "bold"))))
| ds | y |
|---|---|
| 2018-12-01 | 1665.906 |
| 2019-01-01 | 1171.515 |
| 2019-02-01 | 1288.302 |
| 2019-03-01 | 1642.796 |
| 2019-04-01 | 1372.696 |
| 2019-05-01 | 1628.073 |
| 2019-06-01 | 1554.625 |
| 2019-07-01 | 1443.862 |
| 2019-08-01 | 1688.154 |
| 2019-09-01 | 1312.337 |
autoplot(USVSales, fill = "red") + ggtitle("US Monthly Vehicle Sales") + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14)) +
xlab("Year") + ylab("Number of sold cars")
We can see that the model has a very clear seasonal component, as well as a non-linear trend, and various cycles. Snce we want to forecast the following 12 months, it makes no sense to use all the dataset, since it may introduce noise into our model. We filter the data of the last cycle (the actual cycle)
USVSales10 <- window(USVSales, start = c(2010, 1), frequency = 12)
autoplot(USVSales10, fill = "red") + ggtitle("US Monthly Vehicle Sales (2010-2019)") + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14)) +
xlab("Year") + ylab("Number of sold cars")
decompose(USVSales10) %>% plot()
In order to analyze the seasonal patterns in the data we create a dataframe with year and month columns in order to be able to create some ggplots. Here we can see the last 10 observationsof the dataframe we have created.
USVSales10_df <- ts_to_prophet(USVSales10)
USVSales10_df$year <- lubridate::year(USVSales10_df$ds)
USVSales10_df$month <- lubridate::month(USVSales10_df$ds)
formattable(tail(USVSales10_df, 10), align = c("c", "c", "c", "c"), list("ds" = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold")),
"y" = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
"year" = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
"month" = formatter("span", style = ~ style(color = "grey", font.weight = "bold"))))
| ds | y | year | month | |
|---|---|---|---|---|
| 108 | 2018-12-01 | 1665.906 | 2018 | 12 |
| 109 | 2019-01-01 | 1171.515 | 2019 | 1 |
| 110 | 2019-02-01 | 1288.302 | 2019 | 2 |
| 111 | 2019-03-01 | 1642.796 | 2019 | 3 |
| 112 | 2019-04-01 | 1372.696 | 2019 | 4 |
| 113 | 2019-05-01 | 1628.073 | 2019 | 5 |
| 114 | 2019-06-01 | 1554.625 | 2019 | 6 |
| 115 | 2019-07-01 | 1443.862 | 2019 | 7 |
| 116 | 2019-08-01 | 1688.154 | 2019 | 8 |
| 117 | 2019-09-01 | 1312.337 | 2019 | 9 |
ggplot(data = USVSales10_df, aes(x = y)) + geom_density(data = USVSales10_df, aes(fill = as.factor(year))) +
facet_grid(year~.) + labs(fill = "Year") + ggtitle("US Vehicle Sales density plots - by year") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14))
We can observe that there is a clear trend, the number of cars sold has increased between 2010 and 2012, and seems to have stabilized between 2013 and 2019.
ggplot(data = USVSales10_df, aes(x = y)) + geom_density(data = USVSales10_df, aes(fill = as.factor(month))) +
facet_grid(month~.) + labs(fill = "Month (number)") + ggtitle("US Vehicle Sales Density plots - by month") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14))
Apparently we can’t derive a lot of insights from this plot, because it is very affected by the trend, and therefore it’s not showing the real seasonal pattern. We detrend the series and plot again the same chart, in order to visualize the real pattern.
USVSales10_detrended <- USVSales10 - decompose(USVSales10)$trend
USVSales10_detrended_df <- ts_to_prophet(USVSales10_detrended)
USVSales10_detrended_df$year <- lubridate::year(USVSales10_detrended_df$ds)
USVSales10_detrended_df$month <- lubridate::month(USVSales10_detrended_df$ds)
USVSales10_detrended_df <- USVSales10_detrended_df %>% filter(!is.na(y))
ggplot(data = USVSales10_detrended_df, aes(x = y)) + geom_density(data = USVSales10_detrended_df, aes(fill = as.factor(month))) +
facet_grid(month~.) + labs(fill = "Month (number)") + ggtitle("US Vehicle Sales Density plots - by Year (Detrended series)") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14))
We can see there’s some monthly seasonality: sales increase between January and March and then decrease between April and November. Finally there’s another increase in December.
ggplot(data = USVSales10_detrended_df, aes(x = as.factor(month), y = y)) +
geom_jitter(aes(x = as.factor(month), y = y, colour = as.factor(month)),
size = 2) +
geom_boxplot(aes(fill = as.factor(month)), alpha = 0.5) +
ggtitle("Number of Sold Vehicles - Boxplot") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14),
legend.position = "none") +
xlab("Month (number)") + ylab("Number of sold vehicles")
Now we are going to analyze the correlation of the series with its previous lags.
par(mfrow = c(1,2))
acf(USVSales10, lag.max = 60)
pacf(USVSales10, lag.max = 60)
ts_lags(USVSales10)
Since we want to forecast the following 12 observations, we divide our time series into a training partition with n-12 observations (n is the total number of observations), and our testing partition with 12 observations.
USVSales10_division <- ts_split(USVSales10, sample.out = 12)
train <- USVSales10_division$train
test <- USVSales10_division$test
Holt-Winters is a forecasting model that depends of three parameters (alpha, beta and gamma), which are kind of the weights of an advanced moving average technique. In order to find the best parameters we use a grid search approach: this consists on training the Holt-Winters model with different combinations of parameters (in our case iterating with values from 0 to 1, by = 0.1) in different training partitions (six in our case), and testing them (in our case we test them in 12 observations). Finally we obtain an error metric from the testing partition, which tells us how well is the model fitting the data (in our case we will rely on the MAPE, Mean Absolute Percentage Error) and we select the combination of parameters that gives us the minimum error rate among all. To do this, we use the ts_grid() function, which performs the grid search almost automatically.
shallow_grid <- ts_grid(train,
model = "HoltWinters",
periods = 6,
window_space = 6,
window_test = 12,
hyper_params = list(alpha = seq(0, 1, 0.1),
beta = seq(0, 1, 0.1),
gamma = seq(0, 1, 0.1)),
parallel = TRUE)
grid_df_2 <- shallow_grid$grid_df[,-4:-9]
names(grid_df_2) <- c("alpha", "beta", "gamma", "MAPE")
formattable(head(grid_df_2, 10), align = c("c", "c", "c", "c", "c", "c", "c", "c", "c", "c"), list("alpha" = formatter(
"betta", style = ~ style(color = "grey",font.weight = "bold")),
"gamma" = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
"MAPE" = formatter("span", style = ~ style(color = "grey", font.weight = "bold"))))
| alpha | beta | gamma | MAPE |
|---|---|---|---|
|
|
0.2 | 1.0 | 4.049828 |
|
|
0.2 | 0.9 | 4.063146 |
|
|
0.3 | 0.6 | 4.104964 |
|
|
0.2 | 0.8 | 4.110752 |
|
|
0.3 | 0.7 | 4.115662 |
|
|
0.3 | 0.8 | 4.139951 |
|
|
0.4 | 0.4 | 4.147130 |
|
|
0.3 | 0.9 | 4.162940 |
|
|
0.4 | 0.5 | 4.166999 |
|
|
0.2 | 0.7 | 4.170236 |
The best model seems to be the Holt-Winters with alpha = 0.1, beta = 0.2, and gamma = 1, because it gives the lowest MAPE (mean column in he table)
We can also see the results by a 3D plot, with 3 axis, alpha, beta and gamma, and the colour representing the MAPE score of each combination.
plot_grid(shallow_grid, type = "3D")
md_hw <- HoltWinters(train,
alpha = shallow_grid$alpha,
beta = shallow_grid$beta,
gamma = shallow_grid$gamma)
fc_hw <- forecast(md_hw, h = 12)
accuracy(fc_hw, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -4.149397 67.39791 53.76542 -0.3759086 3.947348 0.6298539
## Test set -8.447457 73.25790 48.82061 -0.8642598 3.358505 0.5719261
## ACF1 Theil's U
## Training set 0.02499219 NA
## Test set -0.24310794 0.2835323
test_forecast(USVSales10,
forecast.obj = fc_hw,
test = test)
checkresiduals(md_hw)
par(mfrow = c(1,2))
acf(USVSales10, lag.max = 60)
pacf(USVSales10, lag.max = 60)
If we take a look to the series, we will see that it is not stationary (its mean and variance are not stable):
autoplot(USVSales10, fill = "red") + ggtitle("US Monthly Vehicle Sales") + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14)) +
xlab("Year") + ylab("Number of sold cars") + xlim(as.Date(c("2010-01-01", "2019-09-01")))
In order to stabilize the mean and variance, we difference the series respect to its first non-seasonal lag (lag 1).
USVSales10_d1 <- diff(USVSales10, 1)
par(mfrow = c(1,2))
acf(USVSales10_d1, lag.max = 60)
pacf(USVSales10_d1, lag.max = 60)
- There’s still a lot of linear decay in the acf, indicating that the trend hasn’t been removed. - A lot of correlation with its first non-seasonal and seasonal lag.
If we take a look to the series, we will see that it has stabilized its mean, but its variance is still not stable.
autoplot(USVSales10_d1, fill = "red") + ggtitle("US Monthly Vehicle Sales - First order non-seasonal difference") + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14)) +
xlab("Year") + ylab("Number of sold cars") + xlim(as.Date(c("2010-01-01", "2019-09-01")))
Since the series is very correlated with its first seasonal lag, we differenciate it respecting the 12th lag.
USVSales10_d1_d12 <- diff(USVSales10_d1, 12)
par(mfrow = c(1,2))
acf(USVSales10_d1_d12, lag.max = 60)
pacf(USVSales10_d1_d12, lag.max = 60)
If we take a look to the series, we will see that the mean and variance are pretty constant over time, and we have converted the series into a stationary process:
autoplot(USVSales10_d1_d12, fill = "red") + ggtitle("US Monthly Vehicle Sales - First order seasonal and non-seasonal difference") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14)) +
xlab("Year") + ylab("Number of sold cars")
We create a function that trains SARIMA models in the training partition using all the possible combinations of p,d,q,P,D,Q (knowing that p=1 and P=1), returns us its AIC (Akaike’s Information Criteria) and finds the models which provide less AIC (the models that perform better).
p <- 0:2
q <- 0:2
P <- 0:2
Q <- 0:2
d <- D <- 1
arima_grid <- expand.grid(p,d,q,P,D,Q)
names(arima_grid) <- c("p", "d", "q", "P", "D", "Q")
arima_grid$k <- rowSums(arima_grid)
arima_grid <- filter(arima_grid, k <= 6)
arima_search2 <- function(x){
for(i in 1:nrow(x)){
md <- NULL
md <- arima(train, order = c(x$p[i], 1, x$q[i]),
seasonal = list(order = c(x$P[i], 1, x$Q[i])))
x$AIC[i] <- md$aic
x <- x %>% arrange(AIC)
}
x
}
best_arimas <- arima_search2(arima_grid)
best_arimas <- best_arimas[,-7]
formattable(head(best_arimas, 10), align = c("c", "c", "c", "c", "c", "c", "c", "c", "c", "c"), list("p" = formatter(
"d", style = ~ style(color = "grey",font.weight = "bold")),
"q" = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
"P" = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
"D" = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
"Q" = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
"AIC" = formatter("span", style = ~ style(color = "grey", font.weight = "bold"))))
| p | d | q | P | D | Q | AIC |
|---|---|---|---|---|---|---|
|
|
1 | 1 | 2 | 1 | 1 | 1040.113 |
|
|
1 | 1 | 0 | 1 | 2 | 1042.495 |
|
|
1 | 1 | 1 | 1 | 2 | 1042.743 |
|
|
1 | 1 | 1 | 1 | 1 | 1043.544 |
|
|
1 | 1 | 0 | 1 | 2 | 1044.493 |
|
|
1 | 2 | 0 | 1 | 2 | 1044.493 |
|
|
1 | 2 | 1 | 1 | 1 | 1045.382 |
|
|
1 | 0 | 0 | 1 | 2 | 1046.596 |
|
|
1 | 0 | 1 | 1 | 1 | 1046.840 |
|
|
1 | 1 | 0 | 1 | 0 | 1049.633 |
We can see that the best model seems to be an SARIMA(0,1,1)(2,1,1) with an AIC of 1040.113.
md_arima1 <- arima(train, order = c(0,1,1), seasonal = list(order = c(2,1,1)))
fc_arima1 <- forecast(md_arima1, h = 12)
accuracy(fc_arima1, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -9.397901 53.64805 38.08999 -0.822473 2.847185 0.4462185
## Test set -44.087594 78.97211 68.62879 -3.306884 4.787310 0.8039760
## ACF1 Theil's U
## Training set -0.02532106 NA
## Test set -0.08888482 0.3079181
test_forecast(USVSales10,
forecast.obj = fc_arima1,
test = test)
checkresiduals(md_arima1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(2,1,1)[12]
## Q* = 28.802, df = 17, p-value = 0.03639
##
## Model df: 4. Total lags used: 21
Now we try try to fit the SARIMA model that suggests us the auto.arima() function, and we will compare it to our SARIMA model.
md_arima_auto <- auto.arima(train)
md_arima_auto
## Series: train
## ARIMA(0,1,1)(0,1,0)[12]
##
## Coefficients:
## ma1
## -0.8276
## s.e. 0.0690
##
## sigma^2 estimated as 5045: log likelihood=-522.82
## AIC=1049.63 AICc=1049.77 BIC=1054.68
The function recommends an SARIMA(0,1,1)(0,1,0) with an AIC of 1049.77 (higher than our model’s AIC)
fc_arima_auto <- forecast(md_arima_auto, h = 12)
accuracy(fc_arima_auto, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -7.949931 66.12178 47.57403 -0.7180684 3.521087 0.5573227
## Test set -2.245018 70.64598 46.28775 -0.4052621 3.139064 0.5422540
## ACF1 Theil's U
## Training set -0.01677752 NA
## Test set -0.30000047 0.2698991
test_forecast(USVSales10,
forecast.obj = fc_arima_auto,
test = test)
checkresiduals(md_arima_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(0,1,0)[12]
## Q* = 34.014, df = 20, p-value = 0.02603
##
## Model df: 1. Total lags used: 21
The best two candidates are the Holt-Winters model and our SARIMA(0,1,1)(2,1,1). Since they’re different models we will compare them based on their error scores (that are the result of the difference between the forecasted test values and the real test values), and based on the residuals analysis.
accuracy(fc_hw, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -4.149397 67.39791 53.76542 -0.3759086 3.947348 0.6298539
## Test set -8.447457 73.25790 48.82061 -0.8642598 3.358505 0.5719261
## ACF1 Theil's U
## Training set 0.02499219 NA
## Test set -0.24310794 0.2835323
accuracy(fc_arima1, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -9.397901 53.64805 38.08999 -0.822473 2.847185 0.4462185
## Test set -44.087594 78.97211 68.62879 -3.306884 4.787310 0.8039760
## ACF1 Theil's U
## Training set -0.02532106 NA
## Test set -0.08888482 0.3079181
The Holt-Winters has less error rates in general than the SARIMA model.
checkresiduals(md_hw)
checkresiduals(md_arima1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(2,1,1)[12]
## Q* = 28.802, df = 17, p-value = 0.03639
##
## Model df: 4. Total lags used: 21
The Holt-Winters model residuals seem more normally distributed than the ARIMA. Non of both residuals are white noise, and both of the models residuals seem pretty independent.
Since there isn’t a lot of difference in their residuals (non of both models’ residuals are white noise), I have focused on the error metrics to select the best model. The model that performs better in terms of less error is the Holt-Winters, and that’s why I will use it as the final model to do the forecast.
final_md_hw <- HoltWinters(USVSales10,alpha = shallow_grid$alpha,
beta = shallow_grid$beta,
gamma = shallow_grid$gamma)
final_fc_hw <- forecast(final_md_hw, h = 12)
autoplot(final_fc_hw, size = 0.65, fill = "black", conf.int.alpha = 0.15, conf.int.fill = "red", predict.colour = "red") +
ggtitle("US Vehicle Sales - Forecast using Holt-Winters") +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14)) +
xlab("Year") + ylab("Number of sold cars")
autoplot(final_fc_hw, size = 0.65, fill = "black", predict.colour = "red", conf.int.alpha = 0.15, conf.int.fill = "red") +
theme_get() + ggtitle("US Vehicle Sales - Forecast using Holt-Winters (zoomed)") + xlab("Year") + ylab("Number of sold Vehicles")+
xlim(as.Date(c("2016-01-01", "2020-09-01"))) + ylim(c(1000,1800)) +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 13), axis.title = element_text(size = 14)) +
xlab("Year") + ylab("Number of sold cars")